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1. Motivation 

Although they have desirable theoretical properties, Ginsparg-Wilson (GW) [|l]] fermion for- 
mulations are both computationally and conceptually demanding when it comes to simulations 
including effects of dynamical quarks (see eg. ||^, |3|, ^ One way to circumvent this problem is 
the mixed action approach: the ensemble production is performed with a relatively cheap fermion 
discretization and GW fermions are only used for calculations in the valence sector. This allows to 
speed up the simulation while retaining most advantages of the GW fermions, such as their exact 
chiral symmetry and the resulting absence of complicated operator mixings in valence calculations. 
However, because valence and sea quarks are different, the theory suffers from unitarity violations. 
As a result, the chiral perturbation theory formulae needed to extrapolate lattice data in quark mass 
and lattice spacing become more involved and have more free parameters. 



2. Action and algorithm details 
2.1 Choice of action 

To generate the configurations we use the Luscher-Weisz gauge action and stout-smeared 
[^] 0(a) -improved [^] Wilson fermions, where the improvement coefficient csw is taken at tree 
level. The combination of link-fattening and 0{a) improvement greatly reduces chiral symmetry 
breaking effects [^. This action has an improved scaling behavior where the theoretically leading 
0{asa) contributions will, in practice, be negligible and the scaling to the continuum looks as if 
the theory had only 0{a^) cut-off effects [l^]. We use 2, 3 and 6 levels of stouting and several 
j8 -values in order to check the scaling behavior explicitly. The smearing will also significantly 
reduce the appearance of small eigenvalues related to short distance artifacts. This will improve 



the convergence rate of the solvers, as discussed in section 2.3. In the valence sector we use overlap 



fermions | |11| ] with "UV filtering" to improve the locality without altering p = 1 012[]. 



2.2 Choice of parameters 

Setting the strange sea quark mass: The approximate determination of the strange mass is 
done hy Nf = 3 simulations: at a given jS we search for the quark mass where the relation 



mps/mv = \j2m\-m\/m^ (2.1) 

is satisfied. We have determined the j3 dependence of this approximate strange mass in a fairly 
large range (j8 =2.9 — 3.8) and it turns out to be smooth. 

Matching sea and valence quarks: The different discretizations in the sea and valence sectors 
lead to discretization error induced unitarity violating effects. As far as low-energy properties are 
concerned, these effects, as well as those associated with any mismatch between sea and valence 
quark masses, can be accounted for with the appropriate version of mixed action partially quenched 
chiral perturbation theory (MAPQXPT), as described in the accompanying proceedings contribu- 
tion by L. Lellouch [|l^]. Accordingly, a precise matching of the sea and valence sector is not 
needed. Still, in order to remain relatively close to the unitary situation, we have one set of valence 
data where the pion and kaon masses of the two sectors are approximately matched. 

Overview of simulation points: A brief summary of our ensembles is given in Tab.[T[ 
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a [fm] 


Mjc [MeV] 




T 


overlap inversion 


3.3 


0.136 


360 


163 


64 


DONE 






310 


243 


64 


DONE 






250 


243 


64 


DONE 


3.57 


0.088 


570 


243 


64 


DONE 






490 


243 


64 


DONE 






410 


243 


64 


DONE 






300 


323 


64 


DONE 






190 


483 


64 


DONE 


3.7 


0.069 


520 


323 


96 


DONE 






400 


323 


96 


RUNNING 






290 


403 


96 


DONE 



Table 1: Nf = 2+1 simulation points. The last column indicates the status of the overlap inversions. 



2.3 Dynamical fermion algorithm 

We aim to run simulations with 2+1 flavors, with pion masses approaching the physical point. 
To simulate the two light flavours we use the Hybrid Monte Carlo (HMC) [|l^] algorithm with 
even/odd preconditioned [ [T^ ] clover fermions. However, in the regime of light quark masses the 
standard HMC suffers from "critical slowing down", that is on top of the increased computational 
cost per trajectory, the autocorrelation times grow significantly. Several improvements over the 
standard version have been proposed, many of which can be combined. We use the following ones: 



multiple time-scale integration ("Sexton-Weingarten integration scheme") [16] to be able to 
run the computationally most demanding part of the simulation (the inversion of the light 
fermion matrix) at a larger time-step then the comparatively less costly part. 



mass preconditioning ("Hasenbusch trick") [|17[], to reduce jumps in the fermionic force and 



Omelyan integrator [18] to reduce the energy violations during the Molecular Dynamics 



(MD) part of the HMC. 

The strange quark is included via the RHMC algorithm. This method is exact and highly efficient 



when combined with the Sexton-Weingarten integration scheme [19]. 



2.4 Mixed precision solver 

The most time consuming part, both in valence and see sector calculations, is the (incomplete) 
fermion matrix inversion by means of a solver. In order to maintain reversibility, the MD part of the 
HMC algorithm has to be performed in double precision. The same holds true for propagator cal- 
culations at small quark masses, due to the large condition number of the fermion matrix involved. 
However, this does not imply that each fermion matrix multiplication needs to be done in double 
precision. In the valence sector we wish to solve 

Dx = b (2.2) 
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Figure 1: Performance of CG-64 (double precision, red squares) and CG-mix (mixed precision, blue cir- 
cles) during the fat clover MD trajectory (left). Chirally projected CG (double precision) versus relaxed 
GMRESR(SUMR-32) (mixed precision) during overlap propagator calculations (right). 
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Figure 2: Strong scaling analysis of the Wilson fermion matrix multiplication on a global 48^ X 96 lattice. 
Left: performance of the single and double precision kernel in percentage of machine peak, as well as their 
memory footprint for 1024 up to 8192 nodes (2048 to 16384 CPUs). The gray shaded area indicates the size 
of the L3 cache. Right: the same scaling analysis and kernel performance in TFlops. 



with D the overlap or clover operator to construct the correlator. In the sea sector we wish to solve 

D^Dx = b (2.3) 

for the clover action to calculate the fermionic force within the MD. To accelerate the solvers it is, 
in either case, possible to use a single precision version of D within a mixed precision solver. We 
find that there is basically no penalty in terms of the iteration count; the increase of the number of 
forward multiplications is well below 10%. 

How a single precision calculation can be used to accelerate a solver is most transparent in 
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Figure 3: AH (left), CG count (center) and plaquette (right) for the fat clover action at m^c = 190 MeV. 



the valence sector where we use the relaxed GMRESR algorithm with a recursive SUMR [Q] 
preconditioning. In this scheme, the SUMR is merely used to calculate a low precision inversion. 
Therefore, the SUMR can be coded in single precision (except for global sums), even if the GM- 
RESR requires double precision accuracy. Since almost all matrix multiplications are performed 
within the SUMR, the whole solver is dominated by the single precision matrix multiplication 
performance, resulting in a significant speedup (see Fig. |l]). The gain is due to three effects: 

• On a generic computer architecture the peak performance of the single precision solver will 
be larger than that of the double precision version. 

• The performance of the solver is usually bound by the bandwidth to the system memory. 
Thus, on the same architecture, usually twice as many single precision than double precision 
numbers can be loaded from system memory per unit time. 

• The single precision matrix vector multiplication routine requires half the memory of the 
double precision version. The inverter will fit into the cache for larger local lattice sizes and 
the range in which the algorithm scales with the number of nodes will improve (see Fig.^). 

2.5 Phase structure and algorithm stabihty 

During the HMC evolution a number of observables is monitored to detect any potential in- 
stability of the algorithm. In Fig.^ the energy violation AH, the CG iteration number needed to 
reach the residue tolerance and the plaquette are shown for the run with 190 MeV pion mass. After 
the thermalization a stable distribution can be seen. When attempting a "thermal cycle" in the pion 
mass one finds no signs of a hysteresis. Moreover, "cold" and "hot" starts quickly lead to the same 
plaquette (see Fig.^. Altogether, it seems that we are far from any potential bulk phase transition. 

2.6 Overall performance and strong scaling analysis 

Several improved versions of the HMC algorithm can be found in the literature. As shown in 



the left panel of Fig. g, our algorithm is compatible with the performance reported in ||20|, [21| ]. 

The right panel shows the good scaling properties of our HMC variety with the number of 
CPUs of the Blue Gene/L of the Jiilich Supercomputing Centre (JSC). Within errors, the curve is 
perfectly linear, very much as for the Wilson kernel shown in Fig ^. The bottom line is that we can 
simulate dynamical quarks in a regime well below 250 MeV. 
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Figure 4: Thermalized plaquette versus rriji (left) and cold versus hot start plaquette at /3 = 3.5 (right). 
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Figure 5: Left: updated "Berlin Wall" plot - full circles correspond to the pre-2005 dynamical Wilson 
simulations, solid squares represent the algorithm of [|TJ, open squares are this work. Right; Scaling of our 
algorithm with the number of CPUs. 



3. Outlook 

To illustrate the statistical quality of our results, Fig. ^ presents effective mass plateaus of 
charged pions, kaons and non-singlet pseudoscalar mesons, for our lightest sea pion at j3 = 3.57 
(cf. Tab. U) - both with clover (left) and overlap (right) valence quarks. 

The performance and stability of our simulations, the statistical accuracy of our results, as well 
as our growing understanding of their chiral behavior [^] are very promising and we look forward 
to presenting phenomenological results, with controlled extrapolations to the physical point of 2+1 
flavor QCD, in future publications. 
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Figure 6: Mass plateaus for clover (left) and overlap (right) pions, kaons and etas at ~ 190MeV. 
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